This pipeline was created using the following resources:
Within the RStudio file (.rmd), each code chunk can
be executed by clicking the Run button within each
chunk or by placing the cursor inside it and pressing
Cntrl+Shift+Enter.
After installing R and RStudio, the following
Bioconductor R packages are necessary for processing RNASeq raw data,
conducting differential expression analysis and data visualization:
For installing packages, execute the code on the
Installation section in the corresponding Bioconductor
page.
For instance, for installing DESeq2:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
The remaining packages are part of the Comprehensive R Archive
Network (CRAN), which is the official R packages repository. These can
be installed in RStudio by going to Tools →
Install Packages and in the Install
from option select Repository (CRAN), followed
by specifying the intended packages:
- ggplot2
- Data visualization and export options
- ggalt
- Adds extra data visualization flexibility
- ggrepel
- Adds extra data visualization flexibility
- textshaping
- Necessary for EnhancedVolcano package
- tidyverse
- Data processing and visualization
- RColorBrewer
- Color schemes for data visualization
- circlize
- Necessary for ColorRamp scheme in heatmap data
visualization
- pheatmap
- Basic heatmap data visualization
After installing every required package, these libraries should
be loaded into RStudio so that their commands and functions are
available to use. To do so, enter:
#Loading libraries
library("DESeq2")
library("edgeR")
library("AnnotationDbi")
library("org.Hs.eg.db")
library("ggplot2")
library("ggalt")
library("ggrepel")
library("textshaping")
library("tidyverse")
library("RColorBrewer")
library("circlize")
library("pheatmap")
library("ComplexHeatmap")
library("EnhancedVolcano")
Note: Libraries should always be re-loaded when
re-opening the script.
Afterwards, open the raw count data file in RStudio. This file is a
table that results from mapping the sequencing reads to a reference
genome and displays the number of reads assigned to each gene, in each
sample. This tab-delimited file contains a header row with each sample
name and each row name represents a unique Ensembl gene ID (ENSG).
In this work, the file “2D_counts.txt” was used for
differential gene expression analysis of 2D Native
hASCs vs 2D Glycoengineered hASCs. For
assessing the impact of cell-cell tethering in the transcriptome,
2D Glycoengineered hASCs vs 3D
Cellgels samples were pooled into
“2Dv3D_counts.txt”.
For ease of comprehension, each function is described
line-by-line in commented descriptions. For further information, please
refer to each package’s vignette containing the comprehensive command
list.
First, RNASeq analysis in 2D conditions was
conducted:
#Reads count table and assigns it to a variable called "Counts".
Counts <- read.table("2D_counts.txt", header = TRUE, row.names = 1, sep ="\t")
Counts
Note: Before starting the analysis, confirm
that the raw counts table contains the actual unprocessed reads
and not the output of a previous DESeq2 workflow. Raw count values are
stored as integers, whereas normalized counts are decimals numbers.
DESeq2 differential expression analysis can only be conducted on raw
unprocessed counts.
#Creates data frame called "coldata" that assigns experimental conditions ("2DControl" or "2DGly") to each sample and stores this factor in the "treatment" column. Also renames the original sample names in the .bam file to shorter and convenient IDs.
coldata <- DataFrame(id = c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5"), treatment= factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)), levels = c("2DControl", "2DGly")))
#Saves the formatted counts table to a .txt file.
write.table(Counts, file = "2D_counts_v2.txt", sep = "\t", quote=FALSE, col.names = TRUE)
coldata
DataFrame with 10 rows and 2 columns
id treatment
<character> <factor>
1 2DC1 2DControl
2 2DC2 2DControl
3 2DC3 2DControl
4 2DC4 2DControl
5 2DC5 2DControl
6 2DG1 2DGly
7 2DG2 2DGly
8 2DG3 2DGly
9 2DG4 2DGly
10 2DG5 2DGly
Next, a pre-filtering strategy is applied to
the Counts table. Before proceeding with differential
expression analysis, it is important to filter out genes that
consistently present extremely low expression. This reduces multiple
testing burden and helps in increasing the statistical power of the
analysis, while keeping genes of interest.
Herein, counts per million (CPM) are used for
pre-filtering rather than raw counts,
as CPM accounts for differences in library
sizes (i.e., sequencing depth = total number of counts) across
samples, ensuring a more robust comparison. For instance, if samples
yielded uneven library sizes (e.g. 15 M vs 40 M), a filtering criteria
solely based on raw counts would not avoid favoring genes that are
expressed in the larger library. CPM is calculated
through edgeR as the raw counts divided by
library size and multiplied by one million.
Note: CPM calculation through
edgeR package is more sophisticated than manually
applying the formula as it also automatically corrects for library
composition using calcNormFactors.
As a rule of thumb, for library sizes of about 20 M
reads, a CPM = 0.5 is used as it corresponds to a
raw count of 10 - 15.
Note: To view the library size for each sample
enter:
library_size <- colSums(Counts)
library_size.df <- as.data.frame(library_size)
rownames(library_size.df) <- coldata$id
library_size.df
#Cross-checking whether a CPM threshold of 0.5 corresponds to approximately 10 - 15 counts in these samples.
myCPM_colnames = c("C1", "C2", "C3", "C4", "C5", "G1", "G2", "G3", "G4", "G5")
myCPM <- cpm(Counts) #edgeR calculates the Counts Per Million (CPM) values for the Raw Counts object.
plot(myCPM[,1], Counts[,1], xlab="CPM", ylab="Raw Counts", main=paste("Sample", myCPM_colnames[1]),
ylim=c(0,50), xlim=c(0,3))
#Adds reference lines at x = 0.5 CPM and y = 10 Raw Counts
abline(v=0.5) + abline(h=10)

Considering library sizes in this dataset are on the order of 20 M, a
threshold for genes with CPM > 0.5 was established.
An additional requirement for expression in 3 or more
libraries is used, as each group contains 5 replicates.
Applying a stringent pre-filtering threshold ensures a
more reliable and robust analysis. This criteria eliminates genes that
are unlikely to be relevant due to (i) inconsistent
reads across different samples, and (ii) increased fold
change and variance associated with extremely low counts. In addition,
this setup ensures that a gene will be retained even if it is only
expressed in one experimental condition.
#Pre-filtering Counts to CPM > 0.5 in at least 3 different samples
keepCPM <- rowSums(cpm(Counts) > 0.5) >= 3
CPMCounts <- Counts[keepCPM,]
nrow(Counts) #61552 = Total number of mapped genes
nrow(CPMCounts) #14194 = Number of genes after pre-filtering
#Running DESeq2 analysis
ddsCPM <- DESeqDataSetFromMatrix(countData = CPMCounts, colData = coldata, design = ~treatment)
ddsCPM <- DESeq(ddsCPM)
resddsCPM <- results(ddsCPM)
resddsCPM #Show DESeq2 results table
log2 fold change (MLE): treatment 2DGly vs 2DControl
Wald test p-value: treatment 2DGly vs 2DControl
DataFrame with 14194 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000160072 217.7990 0.2211119 0.138383 1.597820 0.1100829 0.2225651
ENSG00000142611 66.1508 -0.4467950 0.178257 -2.506466 0.0121945 0.0393714
ENSG00000225630 525.1312 0.0389046 0.147856 0.263125 0.7924545 0.8773250
ENSG00000131584 42.3183 0.4723784 0.288730 1.636059 0.1018273 0.2098193
ENSG00000169972 130.7999 0.0926638 0.124668 0.743283 0.4573102 0.6180456
... ... ... ... ... ... ...
ENSG00000210196 52.4841 -0.1962727 0.203209 -0.965866 0.334111 0.499162
ENSG00000276256 89.2431 0.1866894 0.135644 1.376321 0.168722 0.306774
ENSG00000273748 44.4494 0.1237211 0.194458 0.636234 0.524624 0.677302
ENSG00000276197 12.6270 0.0673584 0.571645 0.117833 0.906200 0.946688
ENSG00000271254 50.1216 -0.1487663 0.201549 -0.738116 0.460444 0.620557
#Output DESeq2 Normalized Counts - Necessary for submitting RNASeq datasets to the SRA repository.
normalized_counts <- counts(ddsCPM, normalized=TRUE)
write.table(normalized_counts, file = "2D_normalizedcounts.txt", sep = "\t", quote=FALSE, col.names = TRUE)
Next, to improve interpretability, it is useful to convert the
mapped ENSEMBL IDs (e.g., ENSG00000115414) to the more
recognizable Gene Symbols (e.g., FN1 for the
Fibronectin gene).
#Adds a new column "geneID" to the results table containing the corresponding Gene Symbols to each row
GeneIDs <- mapIds(org.Hs.eg.db,
keys = row.names(resddsCPM),
keytype = "ENSEMBL",
column = "SYMBOL")
ENS_geneIDs <- stack(GeneIDs)
resddsCPM$geneID <- ENS_geneIDs$values
resddsCPM #Show the updated DESeq2 results table with the new geneID column
log2 fold change (MLE): treatment 2DGly vs 2DControl
Wald test p-value: treatment 2DGly vs 2DControl
DataFrame with 14194 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue padj geneID
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character>
ENSG00000160072 217.7990 0.2211119 0.138383 1.597820 0.1100829 0.2225651 ATAD3B
ENSG00000142611 66.1508 -0.4467950 0.178257 -2.506466 0.0121945 0.0393714 PRDM16
ENSG00000225630 525.1312 0.0389046 0.147856 0.263125 0.7924545 0.8773250 MTND2P28
ENSG00000131584 42.3183 0.4723784 0.288730 1.636059 0.1018273 0.2098193 ACAP3
ENSG00000169972 130.7999 0.0926638 0.124668 0.743283 0.4573102 0.6180456 PUSL1
... ... ... ... ... ... ... ...
ENSG00000210196 52.4841 -0.1962727 0.203209 -0.965866 0.334111 0.499162 NA
ENSG00000276256 89.2431 0.1866894 0.135644 1.376321 0.168722 0.306774 LOC389831
ENSG00000273748 44.4494 0.1237211 0.194458 0.636234 0.524624 0.677302 NA
ENSG00000276197 12.6270 0.0673584 0.571645 0.117833 0.906200 0.946688 NA
ENSG00000271254 50.1216 -0.1487663 0.201549 -0.738116 0.460444 0.620557 LOC102724250
Note: Gene symbols (‘geneID’ column) with
NA values are typically pseudogenes, some of those associated
with producing antisense RNA transcripts. Not every ENSEMBL Gene ID
corresponds to a HGNC Gene Symbol entry.
#Output full results: Normalized counts + DESeq2 results table
results_sum <- cbind(normalized_counts, resddsCPM)
#write.csv(results_sum, file = "2D_full_results.csv", quote=FALSE, col.names = TRUE) #.CSV alternative to upload data
write.table(results_sum, file = "2D_full_results.txt", sep = "\t", quote=FALSE, col.names = TRUE) #.TXT for better data fetching
After DESeq2 processing, differential expression analysis of
genes can be conducted. In this work, genes are considered to be
Differentially Expressed Genes (DEGs) if they exhibit
an adjusted p-value < 0.05 (padj,
Benjamini-Hochberg procedure for multiple testing correction) and
present log2(fold change) > 1 or <
-1.
Specifically, a log2(fold change) > 1 translates
into an > 2-fold upregulated gene expression,
whereas a log2(fold change) < -1 indicates a
< 0.5 fold downregulation. By combining both of
these thresholds, this robust criteria provides greater specificity for
identifying genes that are not only statistically significant, but are
also more likely to be biologically relevant.
#Analysis of DEGs distribution for Donut Plot
resddsCPM <- resddsCPM[!is.na(resddsCPM$padj), ] #Clean-up of genes containing 'NA' in 'padj' column. This occurs in genes filtered out by DESeq2 through automatic independent filtering or containing a sample with an extreme count outlier, as detected by Cook's distance.
sigs <- resddsCPM[resddsCPM$padj < 0.05,] #Creates a subset of genes that meet the padj < 0.05 threshold.
sigsALL <- sigs[abs(sigs$log2FoldChange) > 1,] #Total DEGs, use of absolute log2fold change values ('abs').
sigsUP <- sigs[sigs$log2FoldChange > 1,] #Upregulated DEGs.
sigsDOWN <- sigs[sigs$log2FoldChange < -1,] #Downregulated DEGs.
#Summary
#dim(resddsCPM) #Total genes = 14193
#dim(sigs) #Total statistically significant genes = 4692
#dim(sigsALL) #Total DEGs = 568
#dim(sigsUP) #Upregulated DEGs = 180
#dim(sigsDOWN) #Downregulated DEGs = 388
#Output DEGs
sigsALL.df <- as.data.frame(sigsALL)
sigsALL.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigsALL.df), keytype = "ENSEMBL", column = "SYMBOL") #Alternative way to add Gene Symbols to each row, available upon conversion to a data.frame object.
write.csv(sigsALL.df, "2D_DEGs.csv", quote = FALSE, col.names = TRUE)
#Output Volcano table
write.csv(resddsCPM, "2D_Volcano.csv", quote = FALSE, col.names = TRUE)
Next, the quality of the RNASeq dataset can be validated by
plotting the per-gene dispersion estimates versus the mean
expression. The visualization of this plot
(plotDispEsts) allows for rapidly checking whether the
fitted mean-dispersion relationship (red trend
line) shows an appropriate fit to the data and
ensure the validity of differential testing assumptions
in downstream analysis.
#Plot fitted mean-dispersion relationship
plotDispEsts(ddsCPM)

In addition, it is important to perform a scatterplot of the
biological coefficient of variation
(BCV) against the average abundance of each gene. A
BCV plot provides further visualization of the
biological variability across samples in terms of gene
expression, and whether this variance is adequate for fitting.
As found below, the common dispersion (Disp) is
0.01331 and the BCV (square root of
Disp) is 0.1154. Typically, the
asymptotic value for BCV tends to be within the
0.05 - 0.2 range for cell line experiments or
genetically identical mice, whereas larger values (>
0.3) can be found in human subjects datasets. As a result, this
BCV analysis further confirms the validity of the
dataset for subsequent downstream analysis, as it will be demonstrated
below.
#edgeR processing for fitting and calculations
condition = factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)))
dgeGlm <- DGEList(counts = CPMCounts, group = condition)
str(dgeGlm)
names(dgeGlm)
dgeGlm[["samples"]]
#Calculate common dispersion (Disp) and biological coefficient of variation (BCV)
design <- model.matrix(~condition)
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE) #Disp = 0.01331, BCV = 0.1154
Disp = 0.01331 , BCV = 0.1154
#Plotting BCV scatterplot
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)

RNASeq datasets often contain thousands of genes across multiple
samples, making it extremely complex to detect and analyze patterns of
gene expression data. In this context, Principal Component
Analysis (PCA) transforms data into a set of
uncorrelated variables (i.e., principal components), which is useful for
reducing data dimensionality, providing a blind and
unbiased assessment of gene expression patterns across
multiple samples and conditions. PC1 describes the most variation within
the data, PC2 the second most, and so forth.
Therefore, PCA is highly valuable for
(i) identifying gene expression trends
(e.g., confirming the biological relevance of the treatment/disease or
experimental condition), (ii) identifying
clusters of samples and their relationships, and
(iii) screening out any potential
outliers or batch effects. Visualizing
these effects assists in confirming the hypothesis tested in further
DEGs downstream analysis.
To create a PCA plot, first a vst()
(i.e., variance-stabilizing transformation) function should be
applied to remove the dependence of the variance on the mean. This is
because analysis of multi-dimensional data (such as PCA) works best for
homoskedastic data (i.e., consistent variability), whereas in RNASeq
data variance increases with the mean values. Unlike the other DESeq2
function (i.e., rlog()) function for this purpose,
vst() is more sensitive to size factor and also
normalizes with respect to library size.
#Output 2D PCA Plot
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation to DESeq object 'ddsCPM'
#By default both these functions are **blind** to the sample design formula given to DESeq2 in DESeqDataSetFromMatrix(), however this is not appropriate if one expects large differences in counts explained by differences in experimental design. In such cases, the blind parameter should be set to 'FALSE' as recommended in DESeq2 vignette.
#plotPCA(vsdata, intgroup = "treatment") #Preview PC1 and PC2 % for correct axis labeling.
PCAraw <- plotPCA(vsdata, intgroup = "treatment") #Create a PCA plot using 'vsdata' and groups it by 'treatment'.
group.colors <- c("#619CFF","#00C19F") #Vector containing color codes to be used in each group.
new_pca <- ggplot(PCAraw[["data"]], #Create a ggplot using 'data' data.frame within 'PCAraw' object.
aes(PC1, PC2, color=group)) + #Construct aesthetic mapping using PC1 and PC2.
scale_color_manual(values = group.colors) + #Specify a vector of colors.
geom_point(size = 2.5) + #Change marker size.
xlab("PC1 (79%)") + #Text used for X axis.
ylab("PC2 (9%)") + #Text used for Y axis.
ylim(-10, 10) + #Lower/upper limits for Y axis.
xlim(-12, 12) + #Lower/upper limits for X axis.
theme_grey() + #Set a grey background with white gridlines.
theme(axis.title.x = element_text(size = 14), #Customize size and style of axis titles and text.
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 12, face = "bold"))
new_pca
ggsave("2D_PCA.svg", new_pca, dpi = 600, width = 2.6, height = 2, units = "in", scale = 2) #Exports plot in .svg

Apart from PCA, another important analysis to
perform is to determine whether intergroup variability
- which represents differences between experimental conditions in
comparison with control conditions - is greater than the
intragroup variability, representing technical or
biological variance. This is achieved by calculating
distance as represented by correlation across samples,
typically using the Pearson’s correlation plot.
Pearson’s coefficient reflects the linear
relationship between 2 variables accounting for differences in mean and
standard deviation. For instance, the more similar the gene
expression profiles for all transcripts are between two
samples, the higher the correlation coefficient will
be. These coefficients are calculated between all samples and are then
visualized as a heatmap.
Similar to PCA, the Pearson’s correlation plot
provides an overview of the overall strength of the biological
effect of the experiment, assessing whether replicates and
different conditions group together, as well as identify outliers that
were not excluded during upstream steps (e.g., sample labeling or
genomic alignment).
#Output 2D Pearson's Correlation Plot
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation.
sampleDists <- cor(method = "pearson", assay(vsdata)) #Calculate Pearson correlation coefficients.
sampleDistMatrix <- as.matrix(sampleDists) #Convert to matrix format.
rownames(sampleDistMatrix) <- paste(vsdata$id) #Label Sample names.
colnames(sampleDistMatrix) <- paste(vsdata$id) #Label Sample names.
colors <- colorRampPalette(brewer.pal(9, "Blues"))(255) #Color palette range for heatmap visualization.
SampleDistance = pheatmap(sampleDistMatrix, col=colors) #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
SampleDistance #ggsave("SD_2D_Pearson.svg", SampleDistance, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2)

#Alternative: Euclidean Distances Correlation Plot, as described in DESeq2 vignette
sampleDists_Euclidean <- dist(t(assay(vsdata))) #Calculate Pearson correlation coefficients.
sampleDistMatrix_Euclidean <- as.matrix(sampleDists_Euclidean) #Convert to matrix format.
rownames(sampleDistMatrix_Euclidean) <- paste(vsdata$id) #Label Sample names.
colnames(sampleDistMatrix_Euclidean) <- paste(vsdata$id) #Label Sample names.
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255) #Color palette range for heatmap visualization.
SampleDistancevignette = pheatmap(sampleDistMatrix_Euclidean, #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
clustering_distance_rows=sampleDists_Euclidean, #Clustering by Euclidean distance.
clustering_distance_cols=sampleDists_Euclidean, #Clustering by Euclidean distance.
col=colors)
SampleDistancevignette #ggsave("2D_SD.svg", SampleDistancevignette, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2)

Next, a Volcano plot can be assembled to
provide a general overview of the gene expression profiles of 2D Native
hASCs vs 2D Glycoengineered hASCs. Volcano plots
provide a clear and intuitive visualization of statistical
significance and the extent of fold changes
for every gene detected. Here, each gene is plotted with their
statistical significance (adjusted p-values on the y-axis) agaisnt the
fold change (log2 fold change on the x-axis).
Then, a threshold is set to highlight genes that are
differentially expressed (DEGs), as represented in the
plot in the form of dashed lines. As aforementioned, this work considers
as DEGs genes that present an adjusted p-value
< 0.05 and log2(fold change) > 1 or
< -1. This combination of significance and
fold change information allows researchers to efficiently
identify and prioritize genes
of interest for further investigation. In addition,
Volcano plots can also assist in screening out genes
with high statistical significance but low fold changes, which may be
significant due to large sample sizes but may not have substantial
biological relevance.
#Output 2D Volcano plot with specific Genes of Interest labelled
resddsCPM.df <- as.data.frame(resddsCPM) #Convert DESeq2 results object to a data frame.
#Create custom key-value pairs for 'Up', 'Down', 'Non-DEG' expression. Achieved using nested 'ifelse' statements.
keyvals <- ifelse(resddsCPM$padj > 0.05, 'grey30', #If padj > 0.05 assign grey color.
ifelse(resddsCPM$log2FoldChange < -1, 'royalblue', #If log2foldchange < -1, assign blue color.
ifelse(resddsCPM$log2FoldChange > 1, 'firebrick3', #If log2foldchange > 1, assign red color.
'grey30'))) #Other values assigned grey color.
#Rename elements in 'keyvals' vector:
names(keyvals)[keyvals == 'firebrick3'] <- 'Up' #Assign 'Up' to elements in red.
names(keyvals)[keyvals == 'grey30'] <- 'Non-DEG' #Assign 'Non-DEG' to elements in red.
names(keyvals)[keyvals == 'royalblue'] <- 'Down' #Assign 'Up' to elements in red.
#Define vector containing Gene Symbols to highlight in plot
selected = c("PTGS2", "KRT19", "CXCL1", "SLC7A11", "NDNF", "PGD", "VCAM1", "GPC1", "ANGPT1", "GDF15")
#Volcano plot settings
enhvolc <- EnhancedVolcano(resddsCPM.df, #Create Volcano plot using 'resddsCPM.df'
x = "log2FoldChange", #Column name containing log2(fold change) values.
y = "padj", #Column name containing adjusted p-value values.
#lab = NA, #Uncomment this function for unlabeled genes.
pCutoff = 0.05, #Cut-off threshold for statistical significance.
FCcutoff = 1, #Cut-off threshold for absolute log2(fold change).
title = NULL, #Plot title.
subtitle = NULL, #Plot subtitle.
caption = NULL, #Plot caption.
xlab = bquote(~Log[2]~"(Fold Change)"), #Label for x-axis.
ylab = bquote(~-Log[10] ~ "(" * italic(P[adj]) * ")"), #Label for y-axis.
colAlpha = 0.5, #Alpha for color transparency.
colCustom = keyvals, #Override color scheme (e.g. color by pathway, cell-type or condition).
pointSize = 1, #Size of plotted points.
labSize = 4, #Size of labels for each point.
drawConnectors = TRUE, #Connect labels to each corresponding point by line connectors.
widthConnectors = 0.65, #Line width of connectors.
arrowheads = FALSE, #Draw arrowheads instead of simple line connectors.
boxedLabels = TRUE, #Draw labels within boxes.
directionConnectors = 'both', #Direction to draw connectors (e.g., 'y', 'x' or 'both').
xlim = c(-6, 6), #Limits of x-axis.
ylim = c(0, -log10(10e-255)), #Limits of y-axis.
#Exclude these if not highlighting specific genes
lab = resddsCPM.df$geneID, #Input Gene Symbol column of data.frame object.
selectLab = selected #Vector containing a subset of genes of interest.
)
#Customize axis limits and breaks
p1 <- enhvolc +
ggplot2::coord_cartesian(xlim = c(-6.4, 6.4), #Zoom plot to the specific x-axis limit.
ylim = c(0, 250)) + #Zoom plot to the specific y-axis limit.
ggplot2::scale_x_continuous(breaks=seq(-6, 6, 2)) + #Set spacing of breaks in plot x-axis.
ggplot2::scale_y_continuous(breaks=seq(0, 250, 50)) #Set spacing of breaks in plot y-axis.
p1
ggsave("2D_Volcano_Wide_Label.svg", p1, width = 4.5, height = 2.8, units = "in", scale = 2)

Alternatively, Volcano plots can be customized
via other packages (e.g., ‘ggplot2’) as subplot overlays for
presenting a list of target genes with different shading
specifications.
For instance, overlaying all genes that
match a specific requirement (i.e., all ILs genes).

In addition, it is also possible to assemble a Volcano
plot containing an overlay of genes of
interest on top.

Alternatively, it is possible to assemble a MA
plot, which conveys a similar overview to Volcano
plots. However, this plot lacks the design flexibility that
Volcano plots offer through their dedicated R packages.
#Alternative to Volcano - MA Plot. Less flexibility in customizing the final figure.
MAPlot_dds <- DESeq2::plotMA(resddsCPM, alpha = 0.05, #Set p-value cut-off threshold.
colNonSig = "gray30", #Set color code for Non-DEG.
colSig = "royalblue3", #Set color code for statistically significant.
ylim = c(-7, 7)) #Set y-axis limit.
abline(h = c(1, -1), lty = 2) #Draw horizontal lines at the specified y-axis values, line type - dashed.

Apart from Volcano plots,
Heatmaps provide an important visualization tool in
RNASeq for detecting gene expression trends and highlighting genes with
relevant biological functions. Herein, each sample is assigned to a
column, and each row corresponds to a DEG with
color-coded expression level. Expression values are
typically normalized and transformed (i.e., row
z-scaling) to ensure that genes with varying expression levels
can be compared easily. Heatmaps are usually paired
with hierarchical clustering algorithms (e.g.,
Pearson’s) that group together genes and samples with similar expression
profiles, generating a hierarchical tree-like dendrogram.
Heatmaps are thus useful in RNASeq downstream analysis
for revealing co-expression patterns and
biological functions affected by experimental
conditions.
#Transformation of data for Heatmap
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation.
vst_mat <- assay(vsdata) #Create matrix with transformed gene expression values.
vst_mat <- vst_mat[rownames(sigsALL.df), ] #Subset matrix to include only DEGS, i.e., rows that match genes listed in 'sigsALL.df'.
Genename <- mapIds(org.Hs.eg.db, #Map Ensembl ID from matrix to Gene Symbol and store in variable 'Genename'.
keys = row.names(vst_mat),
keytype = "ENSEMBL",
column = "SYMBOL")
#Replace NA gene symbols with ENSEMBL IDs in vst_mat.
#(i) Identify which entries in 'Genename' have missing values (NA).
no_symbol_idx <- is.na(Genename)
#(ii) For 'Genename' entries with NA, replace values with original Ensembl ID extracted from 'vst_mat' row names.
Genename[no_symbol_idx] <- row.names(vst_mat)[no_symbol_idx]
#(iii) Convert 'Genename' vector into a single column data.frame, making it suitable for assignment to row names of 'vst_mat'
ENS_geneIDs <- stack(Genename)
#(iv) Replace Ensembl ID entries with Gene Symbols when available, and retains original Ensembl ID if NA.
row.names(vst_mat) <- ENS_geneIDs$values
#Double-check that no DEGs were lost during assignment
#dim(vst_mat) #568 DEGs -> TRUE
#Scale genes z-score by row, as otherwise large values from highly expressed genes would dominate the plot
vst_mat_z <- t(scale(t(vst_mat)))
#Assign column names to each sample
colnames(vst_mat_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")
#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2), #Create segments for the color gradient.
c("#2166AC", "white", "#B2182B")) #Create color gradient with the specified color codes.
lgd_list <- list(title = "Z-score", #Set legend title.
legend_height = unit(2.5, "cm"), #Set height of the legend body.
grid_width = unit(0.4, "cm"), #Set width of each legend grid.
title_position = "topcenter", #Set position of the title.
gap = unit(4, "mm")) #Set gap between heatmap and legend.
#Output full DEGs Heatmap
ht <- Heatmap(vst_mat_z, #Matrix input for heatmap.
col = col_fun, #Set vector of colors for interpolation.
show_column_names = FALSE, #Whether to show column labeling.
show_row_names = FALSE, #Whether to show row labeling.
#row_labels = sigs.df[rownames(vst_mat_z),]$Symbol, #Uncomment to specify row labeling.
row_names_gp = gpar(fontsize = 5), #Set font size for row labeling.
clustering_distance_columns = "pearson", #Set clustering method for column dendrogram.
clustering_distance_rows = "pearson", #Set clustering method for row dendrogram.
#column_labels = colnames(vst_mat_z), #Uncomment for specifying column labeling.
column_names_gp = gpar(fontsize = 5), #Set font size for column labeling.
row_dend_reorder = FALSE, #Whether to apply row dendrogram reordering.
column_dend_reorder = TRUE, #Whether to apply column dendrogram reordering.
show_row_dend = TRUE, #Whether to show row dendrogram clustering.
row_dend_width = unit(10, "mm"), #Set width of row dendrograms.
show_column_dend = FALSE, #Whether to show column dendrogram clustering.
width = unit(2, "in"), #Set heatmap width.
height = unit(10, "in"), #Set heatmap height.
name = "Z-score") #Set title.
#ht
#tiff("2D_Full_Heatmap.tiff", width = 4, height = 10, units = "in", res=600)
#draw(ht)
#dev.off()

Next, it becomes highly valuable to zoom in on
specific gene subsets that are involved in relevant
biological processes. Herein, DEGs related to
extracellular matrix organization,
glycoproteins and secreted factors are
mapped to unveil the impact of the cell surface engineering process on
the transcriptome.
#Create vectors of gene lists
Genelist_ECM = c("ADAMTS15", "WNT16", "WNT2", "WNT9A", "ANGPTL4", "BMP4", "MMRN2", "COL21A1", "COL22A1", "OMD", "TNXB", "THSD4")
Genelist_Glycoproteins = c("ADAMTS15", "APCDD1L", "ABCA6", "ABCA8", "ABCA9", "ABCB6", "ABCC2", "ABCC3", "ATP6V0E2", "BRINP1", "CEBPB", "CD200", "EPHA5", "FAM20A", "GPR137B", "GPR68", "GPRC5A", "GPRC5B", "KCNJ5-AS1", "KITLG", "KIT", "LIF", "LYPD1", "MPL", "NOX4", "SHANK2", "SLITRK4", "ST6GALNAC5", "ST6GAL2", "B3GNT8", "WNT16", "WNT2", "WNT9A", "ADCY4", "AFP", "MGAT4B", "AREG", "ANGPT1", "ANGPTL4", "AGTR1", "AQP1", "ACKR3", "B3GALT2", "B4GALNT1", "BMP4", "CDH18", "CDH6", "CDH8", "CHST6", "CDON", "CEMIP", "CHRM2", "CSGALNACT1", "COL21A1", "COL22A1", "C1RL", "CFHR1", "CNNM4", "ENPP4", "ELAPOR1", "ESM1", "EDNRB", "EFNA5", "EPOR", "GRPR", "GPC1", "GDF15", "GDF6", "HHIP", "HGF", "HMMR", "HSD11B1", "IGSF10", "IGFBP3", "ICAM1", "IFNE", "IL1RL1", "IL20RA", "IL34", "JUP", "LRRN3", "LIPG", "ME1", "MXRA5", "MMRN2", "NDNF", "NETO2", "NOTCH4", "OMD", "PLAT", "PDGFRL", "PLXDC1", "GALNT15", "GALNT3", "POPDC3", "KCNK1", "PCSK6", "PCSK9", "PTGDR", "PTGER3", "PTGS2", "PTPRQ", "PCDHB5", "RAET1E", "RNF128", "RNF150", "SPP1", "SEMA3C", "SEMA3D", "SEMA6D", "PRSS35", "SNAI1", "SLC1A2", "SLC12A9", "SLC14A1", "SLC2A12", "SLC2A5", "SLC24A3", "SLC38A1", "SLC38A4", "SLC4A8", "SLC7A11", "SLC7A14", "TNXB", "TENM4", "THBD", "TRPC6", "TMEM130", "VCAM1")
Genelist_Secreted = c("ADAMTS15", "ABCB6", "CCL26", "CXCL1", "CXCL2", "CXCL3", "CXCL6", "CXCL8", "C1QTNF2", "FAM20A", "HTRA3", "KCNJ5-AS1", "KITLG", "LIF", "WNT16", "WNT2", "WNT9A", "AKR1B10", "AFP", "ANGPT1", "ANGPTL4", "BMP4", "CEMIP", "COL21A1", "COL22A1", "C1RL", "CFHR1", "ESM1", "EDN1", "EPOR", "FABP5", "GPC1", "GDF15", "GDF6", "HHIP", "IGSF10", "IGFBP3", "IFNE", "IL1B", "IL1RL1", "IL34", "LIPG", "MXRA5", "MMRN2", "NDNF", "OMD", "PINLYP", "PLAT", "PDGFRL", "PLXDC1", "PCSK6", "PCSK9", "RAET1E", "SPP1", "SCG5", "SEMA3C", "SEMA3D", "PRSS35", "TNXB", "THSD4")
#Replace here [Genelist_ECM] with the name of the vector for specific genes c("XX", "XY", "XZ")
selected_DEGs <- Genelist_Secreted
#Filters the matrix to retain only the rows that correspond to the genes listed in 'selected_DEGs' vector.
vst_matSelected <- vst_mat[rownames(vst_mat) %in% selected_DEGs, ]
#Double check to ensure that all selected DEGs were retrieved
#dim(vst_mat) #568 DEGs
#dim(vst_matSelected) #60 DEGs
#length(Genelist_Secreted) #60 DEGs
#Apply row z-scaling
vst_matSelected_z <- t(scale(t(vst_matSelected)))
#Assign column names to each sample
colnames(vst_matSelected_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")
#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2), #Create segments for the color gradient.
c("#2166AC", "white", "#B2182B")) #Create color gradient with the specified color codes.
lgd_list <- list(title = "Z-score", #Set legend title.
title_gp = gpar(fontsize = 12), #Set font size for legend title.
legend_height = unit(2.5, "cm"), #Set height of the legend body.
legend_width = unit(5, "cm"), #Set width of the legend body.
grid_width = unit(0.4, "cm"), #Set width of each legend grid.
direction = "horizontal", #Whether scale is 'horizontal' or 'vertical'
title_position = "topcenter", #Set position of the title.
gap = unit(4, "mm")) #Set gap between heatmap and legend.
ht_selected <- Heatmap(vst_matSelected_z, #Matrix input for heatmap.
col = col_fun, #Vector of colors for interpolation.
rect_gp = gpar(col = "grey30", lwd = 0.15), #Border line color and width of each cell.
show_column_names = FALSE, #Whether to show column labeling.
show_row_names = TRUE, #Whether to show row labeling.
row_names_gp = gpar(fontsize = 5), #Set font size for row labeling.
clustering_distance_columns = "pearson", #Set clustering method for column dendrogram.
clustering_distance_rows = "pearson", #Set clustering method for row dendrogram.
#column_labels = colnames(vst_matSelected_z), #Uncomment to specify column labeling.
column_names_gp = gpar(fontsize = 5), #Set font size for column labeling.
row_dend_reorder = FALSE, #Whether to apply row dendrogram reordering.
column_dend_reorder = TRUE, #Whether to apply column dendrogram reordering.
show_row_dend = FALSE, #Whether to show row dendrogram clustering.
row_dend_width = unit(10, "mm"), #Set width of row dendrograms.
show_column_dend = FALSE, #Whether to show column dendrogram clustering.
width = unit(3, "in"), #Set heatmap width.
height = unit(3.5, "in"), #Set heatmap height.
heatmap_legend_param = lgd_list, #Reference object for legend parameters.
#show_heatmap_legend = FALSE, #Uncomment to hide heatmap legend.
name = "Z-score") #Set title.
ht_selected = draw(ht_selected, heatmap_legend_side = "bottom") #Draws heatmap with legend at the bottom position.
#Output settings for each heatmap (W = width, H = height)
#2D_ECM: W = 3 x H = 3.5 | Output svg: W = 5 x H = 5
#2D_Secreted: W = 3 x H = 6.5 | Output svg: W = 4 x H = 4.5
#2D_Glycoproteins: W = 3 x H = 3.5 | Output svg: W = 4 x H = 4.5
svg(filename = "2D_Heatmap_Secreted.svg", width = 4, height = 4.5)
draw(ht_selected, heatmap_legend_side = "bottom")
dev.off()

In large Heatmaps, in order to avoid visual
cluttering by labeling every gene, it is often valuable to label only a
subset of genes.
#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2), #Create segments for the color gradient.
c("#2166AC", "white", "#B2182B")) #Create color gradient with the specified color codes.
lgd_list <- list(title = "Z-score", #Set legend title.
title_gp = gpar(fontsize = 12), #Set font size for legend title.
legend_height = unit(2.5, "cm"), #Set height of the legend body.
legend_width = unit(5, "cm"), #Set width of the legend body.
grid_width = unit(0.4, "cm"), #Set width of each legend grid.
direction = "horizontal", #Whether scale is 'horizontal' or 'vertical'
title_position = "topcenter", #Set position of the title.
gap = unit(4, "mm")) #Set gap between heatmap and legend.
#Create vector with subset of genes to label
Glycoproteins_select_label = c("ANGPT1", "B3GNT8", "MGAT4B", "EDNRB", "NDNF", "ADAMTS15", "ST6GALNAC5", "BMP4", "CDH8", "COL21A1", "IL34", "ICAM1", "NOTCH4", "OMD", "SEMA3D", "VCAM1")
Secreted_select_label = c("WNT2", "WNT16", "CXCL1", "CXCL2", "BMP4", "COL22A1", "EDN1", "GDF6", "IL1B", "CEMIP", "LIPG", "SCG5", "MXRA5", "IFNE", "SPP1", "ANGPT1", "NDNF", "KITLG", "AKR1B10")
#Assign genes subset to a static object for easily switching between plots
Genes_Label = Secreted_select_label
#Identify indices of the row names in the matrix that correspond to the genes listed in the vector 'Genes_Label'
mark_at = which(rownames(vst_matSelected_z) %in% Genes_Label)
#Extract the row names of the matrix that correspond to indices in 'mark_at'. This step is a double-check to ensure that the selected rows do indeed match the original 'Genes_Label' vector.
Genes_DoubleCheck <- rownames(vst_matSelected_z)[mark_at]
#Check whether the two vectors contain the same elements, regardless of their order.
setequal(Genes_DoubleCheck, Genes_Label) #TRUE -> proceed to labeling.
#Note: Use 'Genes_DoubleCheck' object for labeling, because it outputs Gene symbols in the same order as 'mark_at'. This is necessary because 'mark_at' extracts row name numbers, which are automatically ordered numerically in the output vector.
#'anno_mark' function in ComplexHeatmap package enables linking heatmap annotation with customized labels.
foo = anno_mark(at = mark_at, #Numeric index from original matrix.
#link_width = unit(5, "mm"), #Uncomment to specify label link width.
padding = 0.75, #Padding space between neighboring labels.
labels = Genes_DoubleCheck, #Corresponding labels.
labels_gp = gpar(fontsize = 10), #Sets font size.
which = "row") #Whether 'column' or 'row' annotation.
ht_selected_label <- Heatmap(vst_matSelected_z, #Matrix input for heatmap.
col = col_fun, #Vector of colors for interpolation.
rect_gp = gpar(col = "grey30", lwd = 0.15), #Border line color and width of each cell.
show_column_names = FALSE, #Whether to show column labeling.
show_row_names = FALSE, #Whether to show row labeling.
#row_names_gp = gpar(fontsize = 4), #Uncomment to specify row label font size.
#row_names_side = "left", #Uncomment to specify row label side.
clustering_distance_columns = "pearson", #Set clustering method for column dendrogram.
clustering_distance_rows = "pearson", #Set clustering method for row dendrogram.
row_dend_reorder = FALSE, #Whether to apply row dendrogram reordering.
column_dend_reorder = TRUE, #Whether to apply column dendrogram reordering.
show_row_dend = FALSE, #Whether to show row dendrogram clustering.
show_column_dend = FALSE, #Whether to show column dendrogram clustering.
width = unit(3, "in"), #Set heatmap width.
height = unit(3.5, "in"), #Set heatmap height.
heatmap_legend_param = lgd_list, #Reference object for legend parameters.
name = "Z-score") #Set title.
ht_merge = ht_selected_label + rowAnnotation(mark = foo)
ht_merge = draw(ht_merge, heatmap_legend_side = "bottom", align_heatmap_legend = "heatmap_left") #Note: 'align_heatmap_legend' = 'heatmap_center' position is offset because heatmap has been merged with 'rowAnnotation'.

svg(filename = "2D_Heatmap_Secreted_CompactLabel.svg", width = 4.5, height = 4.5)
dev.off()

---
title: "RNASeq analysis - R Pipeline"
author: "Pedro Lavrador"
output:
  html_notebook: default
  word_document: default
---

This pipeline was created using the following resources:

* [R](https://cran.r-project.org/bin/windows/base/)
* [RStudio](https://posit.co/download/rstudio-desktop/)

Within the RStudio file (**.rmd**), each code chunk can be executed by clicking the **Run** button within each chunk or by placing the cursor inside it and pressing **Cntrl+Shift+Enter**.

After installing *R* and *RStudio*, the following Bioconductor R packages are necessary for processing RNASeq raw data, conducting differential expression analysis and data visualization:

* [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) - *Differential gene expression analysis*
* [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) - *Differential gene expression analysis*
* [AnnotationDbi](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html) - *Gene ID conversion*
* [org.Hs.eg.db](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) - *Genome annotation database for Human*
* [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html) - *Advanced heatmap data visualization*
* [EnhancedVolcano](https://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html) - *Advanced volcano plot data visualization*


For installing packages, execute the code on the **Installation** section in the corresponding Bioconductor page.

For instance, for installing **DESeq2**:

```{r Package Installation, message=FALSE, warning=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")
```


The remaining packages are part of the Comprehensive R Archive Network (CRAN), which is the official R packages repository. These can be installed in RStudio by going to **Tools** → **Install Packages** and in the **Install from** option select **Repository (CRAN)**, followed by specifying the intended packages:

* [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html) - *Data visualization and export options*
* [ggalt](https://cran.r-project.org/web/packages/ggalt/index.html) - *Adds extra data visualization flexibility*
* [ggrepel](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html) - *Adds extra data visualization flexibility*
* [textshaping](https://cran.r-project.org/web/packages/textshaping/index.html) - *Necessary for EnhancedVolcano package*
* [tidyverse](https://cran.r-project.org/web/packages/tidyverse/index.html) - *Data processing and visualization*
* [RColorBrewer](https://cran.r-project.org/web/packages/RColorBrewer/index.html) - *Color schemes for data visualization*
* [circlize](https://cran.r-project.org/web/packages/circlize/index.html) - *Necessary for ColorRamp scheme in heatmap data visualization*
* [pheatmap](https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf) - *Basic heatmap data visualization*

<br>
After installing every required package, these libraries should be loaded into RStudio so that their commands and functions are available to use. To do so, enter:
```{r echo=T, results='hide'}
#Loading libraries
library("DESeq2")
library("edgeR")
library("AnnotationDbi")
library("org.Hs.eg.db")
library("ggplot2")
library("ggalt")
library("ggrepel")
library("textshaping")
library("tidyverse")
library("RColorBrewer")
library("circlize")
library("pheatmap")
library("ComplexHeatmap")
library("EnhancedVolcano")
```

**Note:** Libraries should always be re-loaded when re-opening the script.
<br>
<br>

Afterwards, open the raw count data file in RStudio. This file is a table that results from mapping the sequencing reads to a reference genome and displays the number of reads assigned to each gene, in each sample. This tab-delimited file contains a header row with each sample name and each row name represents a unique Ensembl gene ID (ENSG).

In this work, the file **"2D_counts.txt"** was used for differential gene expression analysis of **2D Native hASCs** vs **2D Glycoengineered hASCs**. For assessing the impact of cell-cell tethering in the transcriptome, **2D Glycoengineered hASCs** vs **3D Cellgels** samples were pooled into **"2Dv3D_counts.txt"**.

<br>
**For ease of comprehension, each function is described line-by-line in commented descriptions. For further information, please refer to each package's vignette containing the comprehensive command list.**

<br>
**First, RNASeq analysis in 2D conditions was conducted:**
```{r Read 2D counts}
#Reads count table and assigns it to a variable called "Counts".
Counts <- read.table("2D_counts.txt", header = TRUE, row.names = 1, sep ="\t")
Counts
```
**Note:** Before starting the analysis, **confirm that the raw counts table contains the actual unprocessed reads** and not the output of a previous DESeq2 workflow. Raw count values are stored as integers, whereas normalized counts are decimals numbers. DESeq2 differential expression analysis can only be conducted on raw unprocessed counts.

```{r Formats and saves 2D counts}
#Creates data frame called "coldata" that assigns experimental conditions ("2DControl" or "2DGly") to each sample and stores this factor in the "treatment" column. Also renames the original sample names in the .bam file to shorter and convenient IDs.
coldata <- DataFrame(id = c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5"), treatment= factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)), levels = c("2DControl", "2DGly")))

#Saves the formatted counts table to a .txt file.
write.table(Counts, file = "2D_counts_v2.txt", sep = "\t", quote=FALSE, col.names = TRUE)
```

```{r Shows coldata}
coldata
```

<br>
Next, a **pre-filtering** strategy is applied to the **Counts** table. Before proceeding with differential expression analysis, it is important to filter out genes that consistently present extremely low expression. This reduces multiple testing burden and helps in increasing the statistical power of the analysis, while keeping genes of interest.

Herein, **counts per million (CPM)** are used for **pre-filtering** rather than **raw counts**, as **CPM** accounts for differences in **library sizes** (i.e., sequencing depth = total number of counts) across samples, ensuring a more robust comparison. *For instance, if samples yielded uneven library sizes (e.g. 15 M vs 40 M), a filtering criteria solely based on raw counts would not avoid favoring genes that are expressed in the larger library.* **CPM** is calculated through **edgeR** as the **raw counts divided by library size and multiplied by one million**.

**Note:** **CPM** calculation through **edgeR** package is more sophisticated than manually applying the formula as it also automatically corrects for library composition using *calcNormFactors*.

As a rule of thumb, for **library sizes** of about 20 M reads, a **CPM = 0.5** is used as it corresponds to a **raw count** of 10 - 15.

**Note:** To view the library size for each sample enter:
```{r Shows Library size, include=TRUE}
library_size <- colSums(Counts)
library_size.df <- as.data.frame(library_size)
rownames(library_size.df) <- coldata$id
library_size.df
```


```{r CPM Thresholding Check, results=FALSE}
#Cross-checking whether a CPM threshold of 0.5 corresponds to approximately 10 - 15 counts in these samples.
myCPM_colnames = c("C1", "C2", "C3", "C4", "C5", "G1", "G2", "G3", "G4", "G5")
myCPM <- cpm(Counts) #edgeR calculates the Counts Per Million (CPM) values for the Raw Counts object.
plot(myCPM[,1], Counts[,1], xlab="CPM", ylab="Raw Counts", main=paste("Sample", myCPM_colnames[1]), 
     ylim=c(0,50), xlim=c(0,3))
#Adds reference lines at x = 0.5 CPM and y = 10 Raw Counts
abline(v=0.5) + abline(h=10)
```

Considering library sizes in this dataset are on the order of 20 M, a threshold for genes with **CPM > 0.5** was established. An additional requirement for **expression in 3 or more libraries** is used, as each group contains 5 replicates. Applying a stringent **pre-filtering** threshold ensures a more reliable and robust analysis. This criteria eliminates genes that are unlikely to be relevant due to **(i)** inconsistent reads across different samples, and **(ii)** increased fold change and variance associated with extremely low counts. In addition, this setup ensures that a gene will be retained even if it is only expressed in one experimental condition.


```{r Pre-filtering, message=FALSE, results=FALSE}
#Pre-filtering Counts to CPM > 0.5 in at least 3 different samples
keepCPM <- rowSums(cpm(Counts) > 0.5) >= 3
CPMCounts <- Counts[keepCPM,]
nrow(Counts)    #61552 = Total number of mapped genes
nrow(CPMCounts) #14194 = Number of genes after pre-filtering
```

```{r Running DESeq2, message=FALSE, results=TRUE}
#Running DESeq2 analysis
ddsCPM <- DESeqDataSetFromMatrix(countData = CPMCounts, colData = coldata, design = ~treatment)
ddsCPM <- DESeq(ddsCPM)
resddsCPM <- results(ddsCPM)
resddsCPM #Show DESeq2 results table
```
```{r Output Normalized Counts}
#Output DESeq2 Normalized Counts - Necessary for submitting RNASeq datasets to the SRA repository.
normalized_counts <- counts(ddsCPM, normalized=TRUE)
write.table(normalized_counts, file = "2D_normalizedcounts.txt", sep = "\t", quote=FALSE, col.names = TRUE)
```

<br>
Next, to improve interpretability, it is useful to convert the mapped **ENSEMBL IDs** (e.g., ENSG00000115414) to the more recognizable **Gene Symbols** (e.g., FN1 for the Fibronectin gene). 

```{r Assign Gene IDs, message=FALSE, results=TRUE}
#Adds a new column "geneID" to the results table containing the corresponding Gene Symbols to each row
GeneIDs <- mapIds(org.Hs.eg.db,
              keys = row.names(resddsCPM),
              keytype = "ENSEMBL",
              column = "SYMBOL")
ENS_geneIDs <- stack(GeneIDs)
resddsCPM$geneID <- ENS_geneIDs$values
resddsCPM #Show the updated DESeq2 results table with the new geneID column
```
**Note:** Gene symbols ('geneID' column) with *NA* values are typically pseudogenes, some of those associated with producing antisense RNA transcripts. Not every ENSEMBL Gene ID corresponds to a HGNC Gene Symbol entry.

```{r Output Full Results}
#Output full results: Normalized counts + DESeq2 results table
results_sum <- cbind(normalized_counts, resddsCPM)
#write.csv(results_sum, file = "2D_full_results.csv", quote=FALSE, col.names = TRUE) #.CSV alternative to upload data
write.table(results_sum, file = "2D_full_results.txt", sep = "\t", quote=FALSE, col.names = TRUE) #.TXT for better data fetching
```

<br>
After DESeq2 processing, differential expression analysis of genes can be conducted. In this work, genes are considered to be **Differentially Expressed Genes (DEGs)** if they exhibit an **adjusted p-value < 0.05** (**padj**, Benjamini-Hochberg procedure for multiple testing correction) and present **log2(fold change) > 1** or **< -1**.

Specifically, a **log2(fold change) > 1** translates into an **> 2-fold upregulated gene expression**, whereas a **log2(fold change) < -1** indicates a **< 0.5 fold downregulation**. By combining both of these thresholds, this robust criteria provides greater specificity for identifying genes that are not only statistically significant, but are also more likely to be biologically relevant.


```{r Output DEGs, warning=FALSE, message=FALSE, results=FALSE}
#Analysis of DEGs distribution for Donut Plot

resddsCPM <- resddsCPM[!is.na(resddsCPM$padj), ] #Clean-up of genes containing 'NA' in 'padj' column. This occurs in genes filtered out by DESeq2 through automatic independent filtering or containing a sample with an extreme count outlier, as detected by Cook's distance.

sigs <- resddsCPM[resddsCPM$padj < 0.05,]       #Creates a subset of genes that meet the padj < 0.05 threshold.
sigsALL <- sigs[abs(sigs$log2FoldChange) > 1,]  #Total DEGs, use of absolute log2fold change values ('abs').
sigsUP <- sigs[sigs$log2FoldChange > 1,]        #Upregulated DEGs.
sigsDOWN <- sigs[sigs$log2FoldChange < -1,]     #Downregulated DEGs.

#Summary
#dim(resddsCPM) #Total genes = 14193
#dim(sigs)      #Total statistically significant genes = 4692
#dim(sigsALL)   #Total DEGs = 568
#dim(sigsUP)    #Upregulated DEGs = 180
#dim(sigsDOWN)  #Downregulated DEGs = 388

#Output DEGs
sigsALL.df <- as.data.frame(sigsALL)
sigsALL.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigsALL.df), keytype = "ENSEMBL", column = "SYMBOL") #Alternative way to add Gene Symbols to each row, available upon conversion to a data.frame object.
write.csv(sigsALL.df, "2D_DEGs.csv", quote = FALSE, col.names = TRUE)

#Output Volcano table
write.csv(resddsCPM, "2D_Volcano.csv", quote = FALSE, col.names = TRUE)
```

<br>
Next, the quality of the RNASeq dataset can be validated by plotting the **per-gene dispersion estimates versus the mean expression**. The visualization of this plot (**plotDispEsts**) allows for rapidly checking whether the **fitted mean-dispersion relationship** (*red trend line*) shows an **appropriate fit** to the data and ensure the **validity** of differential testing assumptions in downstream analysis.

```{r Plot Dispersions}
#Plot fitted mean-dispersion relationship
plotDispEsts(ddsCPM)
```

<br>
In addition, it is important to perform a scatterplot of the **biological coefficient of variation** (**BCV**) against the average abundance of each gene. A **BCV plot** provides further visualization of the **biological variability** across samples in terms of gene expression, and whether this variance is adequate for fitting.


As found below, the common dispersion (**Disp**) is **0.01331** and the **BCV** (square root of **Disp**) is **0.1154**. Typically, the asymptotic value for **BCV** tends to be within the **0.05 - 0.2** range for cell line experiments or genetically identical mice, whereas larger values (**> 0.3**) can be found in human subjects datasets. As a result, this **BCV** analysis further confirms the validity of the dataset for subsequent downstream analysis, as it will be demonstrated below.


```{r edgeR Processing and BCV Plot, message=FALSE,echo=TRUE, results='hide'}
#edgeR processing for fitting and calculations
condition = factor(c(rep("2DControl", times = 5), rep("2DGly", times = 5)))
dgeGlm <- DGEList(counts = CPMCounts, group = condition)
str(dgeGlm)
names(dgeGlm)
dgeGlm[["samples"]]
```

```{r BCV value and BCV Plot, echo=TRUE}
#Calculate common dispersion (Disp) and biological coefficient of variation (BCV)
design <- model.matrix(~condition)
dgeGlmComDisp <- estimateGLMCommonDisp(dgeGlm, design, verbose = TRUE) #Disp = 0.01331, BCV = 0.1154


#Plotting BCV scatterplot
dgeGlmTrendDisp <- estimateGLMTrendedDisp(dgeGlmComDisp, design)
dgeGlmTagDisp <- estimateGLMTagwiseDisp(dgeGlmTrendDisp, design)
plotBCV(dgeGlmTagDisp)
```


```{r EdgeR SmearPlot, include=FALSE}
#edgeR alternative to Volcano Plot
fit <- glmFit(dgeGlmTagDisp, design)
colnames(coef(fit))  
lrt <- glmLRT(fit, coef = 2)  
ttGlm <- topTags(lrt, n = Inf)  
summary(deGlm <- decideTestsDGE(lrt, p = 0.05, adjust = "fdr"))
tagsGlm <- rownames(dgeGlmTagDisp)[as.logical(deGlm)]
plotSmear(lrt, de.tags = tagsGlm)
```

<br>
RNASeq datasets often contain thousands of genes across multiple samples, making it extremely complex to detect and analyze patterns of gene expression data. In this context, **Principal Component Analysis** (**PCA**) transforms data into a set of uncorrelated variables (i.e., principal components), which is useful for reducing data dimensionality, providing a **blind** and **unbiased assessment** of gene expression patterns across multiple samples and conditions. PC1 describes the most variation within the data, PC2 the second most, and so forth.

Therefore, **PCA** is highly valuable for **(i)** identifying **gene expression trends** (e.g., confirming the biological relevance of the treatment/disease or experimental condition), **(ii)** identifying **clusters** of samples and their relationships, and **(iii)** screening out any potential **outliers** or **batch effects**. Visualizing these effects assists in confirming the hypothesis tested in further **DEGs** downstream analysis.


To create a **PCA plot**, first a **vst()** (*i.e., variance-stabilizing transformation*) function should be applied to remove the dependence of the variance on the mean. This is because analysis of multi-dimensional data (such as PCA) works best for homoskedastic data (i.e., consistent variability), whereas in RNASeq data variance increases with the mean values. Unlike the other DESeq2 function (i.e., *rlog()*) function for this purpose, **vst()** is more sensitive to size factor and also normalizes with respect to library size.


```{r 2D PCA}
#Output 2D PCA Plot
vsdata <- vst(ddsCPM, blind = FALSE) #Apply variance-stabilizing transformation to DESeq object 'ddsCPM'
#By default both these functions are **blind** to the sample design formula given to DESeq2 in DESeqDataSetFromMatrix(), however this is not appropriate if one expects large differences in counts explained by differences in experimental design. In such cases, the blind parameter should be set to 'FALSE' as recommended in DESeq2 vignette. 


#plotPCA(vsdata, intgroup = "treatment")          #Preview PC1 and PC2 % for correct axis labeling.
PCAraw <- plotPCA(vsdata, intgroup = "treatment") #Create a PCA plot using 'vsdata' and groups it by 'treatment'.
group.colors <- c("#619CFF","#00C19F")            #Vector containing color codes to be used in each group.
new_pca <- ggplot(PCAraw[["data"]],               #Create a ggplot using 'data' data.frame within 'PCAraw' object. 
  aes(PC1, PC2, color=group)) +                   #Construct aesthetic mapping using PC1 and PC2.
  scale_color_manual(values = group.colors) +     #Specify a vector of colors.
  geom_point(size = 2.5) +                        #Change marker size.
  xlab("PC1 (79%)") +                             #Text used for X axis.
  ylab("PC2 (9%)") +                              #Text used for Y axis.
  ylim(-10, 10) +                                 #Lower/upper limits for Y axis.
  xlim(-12, 12) +                                 #Lower/upper limits for X axis.
  theme_grey() +                                  #Set a grey background with white gridlines.
theme(axis.title.x = element_text(size = 14),     #Customize size and style of axis titles and text.
  axis.title.y = element_text(size = 14),
  axis.text.x = element_text(size = 12, face = "bold"),
  axis.text.y = element_text(size = 12, face = "bold"))

new_pca
ggsave("2D_PCA.svg", new_pca, dpi = 600, width = 2.6, height = 2, units = "in", scale = 2) #Exports plot in .svg
```
<br>
Apart from **PCA**, another important analysis to perform is to determine whether **intergroup variability** - which represents differences between experimental conditions in comparison with control conditions - is greater than the **intragroup variability**, representing technical or biological variance. This is achieved by **calculating distance** as represented by correlation across samples, typically using the **Pearson's correlation plot**. 

**Pearson's** coefficient reflects the linear relationship between 2 variables accounting for differences in mean and standard deviation. For instance, the **more similar the gene expression profiles for all transcripts** are between two samples, the **higher the correlation coefficient** will be. These coefficients are calculated between all samples and are then **visualized as a heatmap**.

Similar to PCA, the **Pearson's correlation plot** provides an overview of the **overall strength of the biological effect** of the experiment, assessing whether replicates and different conditions group together, as well as identify outliers that were not excluded during upstream steps (e.g., sample labeling or genomic alignment).



```{r 2D Sample Distances Correlation}
#Output 2D Pearson's Correlation Plot
vsdata <- vst(ddsCPM, blind = FALSE)                    #Apply variance-stabilizing transformation.
sampleDists <- cor(method = "pearson", assay(vsdata))   #Calculate Pearson correlation coefficients.
sampleDistMatrix <- as.matrix(sampleDists)              #Convert to matrix format.
rownames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colnames(sampleDistMatrix) <- paste(vsdata$id)          #Label Sample names.
colors <- colorRampPalette(brewer.pal(9, "Blues"))(255) #Color palette range for heatmap visualization.
SampleDistance = pheatmap(sampleDistMatrix, col=colors) #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
SampleDistance #ggsave("SD_2D_Pearson.svg", SampleDistance, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2) 
```

```{r Euclidean Distances Correlation DESeq2 vignette}
#Alternative: Euclidean Distances Correlation Plot, as described in DESeq2 vignette
sampleDists_Euclidean <- dist(t(assay(vsdata)))                 #Calculate Pearson correlation coefficients.
sampleDistMatrix_Euclidean <- as.matrix(sampleDists_Euclidean)  #Convert to matrix format.
rownames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colnames(sampleDistMatrix_Euclidean) <- paste(vsdata$id)        #Label Sample names.
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)    #Color palette range for heatmap visualization.
SampleDistancevignette = pheatmap(sampleDistMatrix_Euclidean,   #Generate heatmap with 'sampleDistMatrix' and 'colors' inputs.
         clustering_distance_rows=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         clustering_distance_cols=sampleDists_Euclidean,        #Clustering by Euclidean distance.
         col=colors)
SampleDistancevignette #ggsave("2D_SD.svg", SampleDistancevignette, dpi = 600, width = 2.55, height = 2, units = "in", scale = 2)
```
<br>
<br>
Next, a **Volcano plot** can be assembled to provide a general overview of the gene expression profiles of 2D Native hASCs vs 2D Glycoengineered hASCs. **Volcano plots** provide a clear and intuitive visualization of **statistical significance** and the **extent of fold changes** for every gene detected. Here, each gene is plotted with their statistical significance (adjusted p-values on the y-axis) agaisnt the fold change (log2 fold change on the x-axis). 

Then, a **threshold** is set to highlight genes that are differentially expressed (**DEGs**), as represented in the plot in the form of dashed lines. As aforementioned, this work considers as **DEGs** genes that present an **adjusted p-value < 0.05** and **log2(fold change) > 1** or **< -1**. This **combination of significance and fold change** information allows researchers to efficiently **identify** and **prioritize** **genes of interest** for further investigation. In addition, **Volcano plots** can also assist in screening out genes with high statistical significance but low fold changes, which may be significant due to large sample sizes but may not have substantial biological relevance.

```{r 2D Volcano plot, warning=FALSE, message=FALSE}
#Output 2D Volcano plot with specific Genes of Interest labelled

resddsCPM.df <- as.data.frame(resddsCPM)                        #Convert DESeq2 results object to a data frame.

#Create custom key-value pairs for 'Up', 'Down', 'Non-DEG' expression. Achieved using nested 'ifelse' statements.
keyvals <-  ifelse(resddsCPM$padj > 0.05, 'grey30',             #If padj > 0.05 assign grey color.
            ifelse(resddsCPM$log2FoldChange < -1, 'royalblue',  #If log2foldchange < -1, assign blue color.
            ifelse(resddsCPM$log2FoldChange > 1, 'firebrick3',  #If log2foldchange > 1, assign red color.
                                                  'grey30')))   #Other values assigned grey color.
#Rename elements in 'keyvals' vector:
names(keyvals)[keyvals == 'firebrick3'] <- 'Up'                 #Assign 'Up' to elements in red.
names(keyvals)[keyvals == 'grey30'] <- 'Non-DEG'                #Assign 'Non-DEG' to elements in red.
names(keyvals)[keyvals == 'royalblue'] <- 'Down'                #Assign 'Up' to elements in red.

#Define vector containing Gene Symbols to highlight in plot
selected = c("PTGS2", "KRT19", "CXCL1", "SLC7A11", "NDNF", "PGD", "VCAM1", "GPC1", "ANGPT1", "GDF15")

#Volcano plot settings
enhvolc <- EnhancedVolcano(resddsCPM.df,                        #Create Volcano plot using 'resddsCPM.df'
        x = "log2FoldChange",                                   #Column name containing log2(fold change) values.
        y = "padj",                                             #Column name containing adjusted p-value values.
        #lab = NA,                                              #Uncomment this function for unlabeled genes.
        pCutoff = 0.05,                                         #Cut-off threshold for statistical significance.
        FCcutoff = 1,                                           #Cut-off threshold for absolute log2(fold change).
        title = NULL,                                           #Plot title.
        subtitle = NULL,                                        #Plot subtitle.
        caption = NULL,                                         #Plot caption.
        xlab = bquote(~Log[2]~"(Fold Change)"),                 #Label for x-axis. 
        ylab = bquote(~-Log[10] ~ "(" * italic(P[adj]) * ")"),  #Label for y-axis.
        colAlpha = 0.5,                                         #Alpha for color transparency.
        colCustom = keyvals,                                    #Override color scheme (e.g. color by pathway, cell-type or condition).
        pointSize = 1,                                          #Size of plotted points.
        labSize = 4,                                            #Size of labels for each point.
        drawConnectors = TRUE,                                  #Connect labels to each corresponding point by line connectors.
        widthConnectors = 0.65,                                 #Line width of connectors.
        arrowheads = FALSE,                                     #Draw arrowheads instead of simple line connectors.
        boxedLabels = TRUE,                                     #Draw labels within boxes.
        directionConnectors = 'both',                           #Direction to draw connectors (e.g., 'y', 'x' or 'both').
        xlim = c(-6, 6),                                        #Limits of x-axis.
        ylim = c(0, -log10(10e-255)),                           #Limits of y-axis.
#Exclude these if not highlighting specific genes
        lab = resddsCPM.df$geneID,                              #Input Gene Symbol column of data.frame object.
        selectLab = selected                                    #Vector containing a subset of genes of interest.
        )

#Customize axis limits and breaks
p1 <-  enhvolc +
    ggplot2::coord_cartesian(xlim = c(-6.4, 6.4),               #Zoom plot to the specific x-axis limit. 
                             ylim = c(0, 250)) +                #Zoom plot to the specific y-axis limit. 
    ggplot2::scale_x_continuous(breaks=seq(-6, 6, 2)) +         #Set spacing of breaks in plot x-axis.
    ggplot2::scale_y_continuous(breaks=seq(0, 250, 50))         #Set spacing of breaks in plot y-axis.
p1
ggsave("2D_Volcano_Wide_Label.svg", p1, width = 4.5, height = 2.8, units = "in", scale = 2)
```
<br>
Alternatively, **Volcano plots** can be customized via other packages (e.g., *'ggplot2'*) as subplot overlays for presenting a list of target genes with different shading specifications.

For instance, **overlaying** all genes that **match a specific requirement** (i.e., all ILs genes).

```{r ggplot2 Volcano, include=FALSE, echo=FALSE}
#Alternative Volcano plot flexibility through 'ggplot2' package. Note: doesn't have EnhancedVolcano flexibility for correcting p-values = 0.

resddsCPMVOLC.df <- resddsCPM.df
#Add a column to the data frame to specify if genes are UP- or DOWN- regulated DEGs
resddsCPMVOLC.df$diffexpressed <- "Non-DEG"
#UPREGULATED DEGs
resddsCPMVOLC.df$diffexpressed[resddsCPMVOLC.df$log2FoldChange > 1 & resddsCPMVOLC.df$padj < 0.05] <- "Upregulated"
#DOWNREGULATED DEGs
resddsCPMVOLC.df$diffexpressed[resddsCPMVOLC.df$log2FoldChange < -1 & resddsCPMVOLC.df$padj < 0.05] <- "Downregulated"

#Volcano 'ggplot2' theme
volc_theme <- theme_set(theme_classic(base_size = 20) +
            theme(
              axis.title.y = element_text(face = "bold", margin = margin(0,10,0,0), size = 15, color = 'black'),
              axis.title.x = element_text(hjust = 0.5, face = "bold", margin = margin(10,0,0,0), size = 15, color = 'black'),
              plot.title = element_text(hjust = 0.5)
            ))

ggplot(data = resddsCPMVOLC.df,
       aes(x = log2FoldChange,
           y = -log10(padj),
       col = diffexpressed)) +
       geom_point(size = 1) + 
       geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
       geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
       scale_color_manual(values = c("royalblue", "grey30", "firebrick3"), #Set color vectors for gene classifications
           labels = c("Downregulated", "Non-DEGs", "Upregulated")) + #Set label names to overwrite text in data.frame (UP, DOWN, NO)
       coord_cartesian(ylim = c(0, 300), xlim = c(-11, 11)) + #Set limits due to padj = 0 on some genes, they will still be plotted but collapsed on edge
       labs(color = 'Legend', #Legend title, 
           x = expression("Log"[2]*"(Fold Change)"),
           y = expression("-Log"[10]*P[adj])) + 
           scale_x_continuous(breaks = c(seq(-10, 10, 5)), #Customize x-axis ticks
           limits = c(-20, 20)) + #Show datapoints within this region
volc_theme
```

```{r ggplot2 Volcano settings, include=FALSE}
resddsCPMVOLC.df %>%
  distinct(diffexpressed) %>%
  pull()  
# "NO"  "UP"  "DOWN"

cols <- c("Downregulated" = "royalblue", "Upregulated" = "firebrick3", "Non-DEG" = "grey30") 
sizes <- c("Downregulated" = 1, "Upregulated" = 1, "Non-DEG" = 1) 
alphas <- c("Downregulated" = 0.5, "Upregulated" = 0.5, "Non-DEG" = 0.5)
# "#ffad73", "down" = "#26b3ff"

ggplotVolc <- ggplot(data = resddsCPMVOLC.df, aes(x = log2FoldChange, y = -log10(padj),
  fill = diffexpressed,
  size = diffexpressed,
  alpha = diffexpressed)) +
  geom_point(shape = 21,
             colour = "black") + 
  geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
  geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
  scale_fill_manual(values = cols) + # Modify point color
  scale_size_manual(values = sizes) + # Modify point size
  scale_alpha_manual(values = alphas) + # Modify point transparency
  scale_x_continuous(breaks = c(seq(-10, 10, 5)),       
                     limits = c(-10, 10)) +
  labs(color = 'Legend', #legend_title, 
       x = expression("log"[2]*"Fold Change"), y = expression("-log"[10]*P[adj]))

ggplotVolc
#ggsave("2Dvs3D_ggplotVolc.svg", ggplotVolc, width = 3.8, height = 2.5, units = "in", scale = 2)
```

```{r GGPLOT Volcano Overlay and Specific Labeling 2, echo=FALSE}
#--------------OVERLAY TARGET GENES ON GREYED OUT VOLCANO PLOTS-----------------------
# Add subplot layer to the main volcano plot -----------------------------------
ils <- str_subset(resddsCPMVOLC.df$geneID, "^IL") #"^(I|L){2}[0-9]+$")  #This fetches gene symbols starting with IL in the name.
#length(ils) #52

il_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% ils)

ggplotVolc_sublayer <- ggplot(data = resddsCPMVOLC.df, # Original data  
       aes(x = log2FoldChange, y = -log10(padj))) + 
  geom_point(colour = "grey", alpha = 0.5) +
  geom_point(data = il_genes, # New layer containing data subset il_genes       
             size = 2,
             shape = 21,
             fill = "firebrick",
             colour = "black") +
  geom_vline(xintercept = c(-1, 1), col = 'grey30', linetype = 'dashed') +
  geom_hline(yintercept = c(-log10(0.05)), col = 'grey30', linetype = 'dashed') +
  labs(color = 'Legend', #legend_title, 
       x = expression("log"[2]*"(Fold Change)"), y = expression("-log"[10]*P[adj])) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 5)), #Customize X axis ticks
                     limits = c(-10, 10)) + # Show datapoints within this region
  volc_theme

ggplotVolc_sublayer
#ggsave("2D_ggplotVolc_sub.svg", ggplotVolc_sublayer, width = 3.8, height = 3, units = "in", scale = 2)
```
<br>
In addition, it is also possible to assemble a **Volcano plot** containing an **overlay** of genes of interest on top.

![](Treated/2Dvs3D_ggplotVolc_labels.svg)

```{r GGPLOT Volcano Overlay and Specific Labeling 3, include=FALSE}
#-----------------OVERLAY LABELED GENES------------------------------------
Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("IL32", "IL34", "IL1B", "COL7A1", "COL10A1", "MMP13", "LAMC2", "LAMB3", "COL24A1", "LAMA1", "VIT"))
UP_Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("IL32", "IL34", "IL1B", "COL7A1", "COL10A1", "MMP13", "LAMC2", "LAMB3"))
DOWN_Selected_genes <- resddsCPMVOLC.df %>%
  filter(geneID %in% c("COL24A1", "LAMA1", "VIT"))

ggplotVolc_labels <- ggplot(data = resddsCPMVOLC.df,
       aes(x = log2FoldChange, y = -log10(padj))) + 
       geom_point(aes(colour = diffexpressed), 
             alpha = 0.2, 
             shape = 16,
             size = 1) + 
  geom_point(data = UP_Selected_genes,
             shape = 21,
             size = 2, 
             fill = "steelblue", 
             colour = "black") + 
  geom_point(data = DOWN_Selected_genes,
             shape = 21,
             size = 2, 
             fill = "firebrick3", 
             colour = "black") + 
   geom_hline(yintercept = -log10(0.05),
             linetype = "dashed") + 
  geom_vline(xintercept = c(log2(0.5), log2(2)),
             linetype = "dashed") +
  geom_label_repel(data = Selected_genes, # Add labels last to appear as the top layer  
                   aes(label = geneID),
                   force = 2,
                   nudge_y = 3,
                   #direction = "y",
                   size = 2,
                   min.segment.length = unit(0, "lines"),
                   label.padding = unit(0.2, "lines")) +
  scale_colour_manual(values = cols) + 
  scale_x_continuous(breaks = c(seq(-10, 10, 2)),     
                     limits = c(-10, 10)) +
  volc_theme

ggplotVolc_labels
#ggsave("2Dvs3D_ggplotVolc_labels.svg", ggplotVolc_labels, width = 3.8, height = 2.5, units = "in", scale = 2)
```

<br>
Alternatively, it is possible to assemble a **MA plot**, which conveys a similar overview to **Volcano plots**. However, this plot lacks the design flexibility that Volcano plots offer through their dedicated R packages.

```{r Alternative 2D MA PLOT}
#Alternative to Volcano - MA Plot. Less flexibility in customizing the final figure.
MAPlot_dds <- DESeq2::plotMA(resddsCPM, alpha = 0.05, #Set p-value cut-off threshold.
               colNonSig = "gray30",                  #Set color code for Non-DEG.
               colSig = "royalblue3",                 #Set color code for statistically significant.
               ylim = c(-7, 7))                       #Set y-axis limit.
abline(h = c(1, -1), lty = 2)                         #Draw horizontal lines at the specified y-axis values, line type - dashed.
```
<br>
Apart from **Volcano plots**, **Heatmaps** provide an important visualization tool in RNASeq for detecting gene expression trends and highlighting genes with relevant biological functions. Herein, each sample is assigned to a column, and each row corresponds to a **DEG** with **color-coded expression level**. Expression values are typically normalized and transformed (i.e., **row z-scaling**) to ensure that genes with varying expression levels can be compared easily. **Heatmaps** are usually paired with **hierarchical clustering** algorithms (e.g., Pearson's) that group together genes and samples with similar expression profiles, generating a hierarchical tree-like dendrogram. **Heatmaps** are thus useful in RNASeq downstream analysis for revealing **co-expression patterns** and **biological functions** affected by experimental conditions.


```{r Heatmap setup, message=FALSE, warning=FALSE}
#Transformation of data for Heatmap
vsdata <- vst(ddsCPM, blind = FALSE)        #Apply variance-stabilizing transformation.
vst_mat <- assay(vsdata)                    #Create matrix with transformed gene expression values.
vst_mat <- vst_mat[rownames(sigsALL.df), ]  #Subset matrix to include only DEGS, i.e., rows that match genes listed in 'sigsALL.df'.
Genename <- mapIds(org.Hs.eg.db,            #Map Ensembl ID from matrix to Gene Symbol and store in variable 'Genename'.
               keys = row.names(vst_mat),
              keytype = "ENSEMBL",
              column = "SYMBOL")

#Replace NA gene symbols with ENSEMBL IDs in vst_mat.
#(i) Identify which entries in 'Genename' have missing values (NA).
no_symbol_idx <- is.na(Genename)

#(ii) For 'Genename' entries with NA, replace values with original Ensembl ID extracted from 'vst_mat' row names.
Genename[no_symbol_idx] <- row.names(vst_mat)[no_symbol_idx]

#(iii) Convert 'Genename' vector into a single column data.frame, making it suitable for assignment to row names of 'vst_mat'
ENS_geneIDs <- stack(Genename)       

#(iv) Replace Ensembl ID entries with Gene Symbols when available, and retains original Ensembl ID if NA.
row.names(vst_mat) <- ENS_geneIDs$values                      


#Double-check that no DEGs were lost during assignment
#dim(vst_mat) #568 DEGs -> TRUE

#Scale genes z-score by row, as otherwise large values from highly expressed genes would dominate the plot
vst_mat_z <- t(scale(t(vst_mat)))

#Assign column names to each sample
colnames(vst_mat_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.
```


```{r Output Heatmap, echo=TRUE, results='hide'}
#Output full DEGs Heatmap
ht <- Heatmap(vst_mat_z,                                        #Matrix input for heatmap.
        col = col_fun,                                          #Set vector of colors for interpolation.
        show_column_names = FALSE,                              #Whether to show column labeling. 
        show_row_names = FALSE,                                 #Whether to show row labeling.
        #row_labels = sigs.df[rownames(vst_mat_z),]$Symbol,     #Uncomment to specify row labeling. 
        row_names_gp = gpar(fontsize = 5),                      #Set font size for row labeling.
        clustering_distance_columns = "pearson",                #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",                   #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_mat_z),                   #Uncomment for specifying column labeling. 
        column_names_gp = gpar(fontsize = 5),                   #Set font size for column labeling.
        row_dend_reorder = FALSE,                               #Whether to apply row dendrogram reordering. 
        column_dend_reorder = TRUE,                             #Whether to apply column dendrogram reordering. 
        show_row_dend = TRUE,                                   #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                        #Set width of row dendrograms.
        show_column_dend = FALSE,                               #Whether to show column dendrogram clustering.
        width = unit(2, "in"),                                  #Set heatmap width.
        height = unit(10, "in"),                                #Set heatmap height.
        name = "Z-score")                                       #Set title.
#ht
#tiff("2D_Full_Heatmap.tiff", width = 4, height = 10, units = "in", res=600)
#draw(ht)
#dev.off()
```

![](Treated/2D_Full_Heatmap.svg)



<br>
<br>
Next, it becomes highly valuable to zoom in on **specific gene subsets** that are involved in relevant **biological processes**. Herein, DEGs related to **extracellular matrix organization**, **glycoproteins** and **secreted factors** are mapped to unveil the impact of the cell surface engineering process on the transcriptome.


```{r Heatmap Selected Subset of Genes, message=FALSE, warning=FALSE, results='hide', fig.keep='all'}
#Create vectors of gene lists
Genelist_ECM = c("ADAMTS15", "WNT16", "WNT2", "WNT9A", "ANGPTL4", "BMP4", "MMRN2", "COL21A1", "COL22A1", "OMD", "TNXB", "THSD4")
Genelist_Glycoproteins = c("ADAMTS15", "APCDD1L", "ABCA6", "ABCA8", "ABCA9", "ABCB6", "ABCC2", "ABCC3", "ATP6V0E2", "BRINP1", "CEBPB", "CD200", "EPHA5", "FAM20A", "GPR137B", "GPR68", "GPRC5A", "GPRC5B", "KCNJ5-AS1", "KITLG", "KIT", "LIF", "LYPD1", "MPL", "NOX4", "SHANK2", "SLITRK4", "ST6GALNAC5", "ST6GAL2", "B3GNT8", "WNT16", "WNT2", "WNT9A", "ADCY4", "AFP", "MGAT4B", "AREG", "ANGPT1", "ANGPTL4", "AGTR1", "AQP1", "ACKR3", "B3GALT2", "B4GALNT1", "BMP4", "CDH18", "CDH6", "CDH8", "CHST6", "CDON", "CEMIP", "CHRM2", "CSGALNACT1", "COL21A1", "COL22A1", "C1RL", "CFHR1", "CNNM4", "ENPP4", "ELAPOR1", "ESM1", "EDNRB", "EFNA5", "EPOR", "GRPR", "GPC1", "GDF15", "GDF6", "HHIP", "HGF", "HMMR", "HSD11B1", "IGSF10", "IGFBP3", "ICAM1", "IFNE", "IL1RL1", "IL20RA", "IL34", "JUP", "LRRN3", "LIPG", "ME1", "MXRA5", "MMRN2", "NDNF", "NETO2", "NOTCH4", "OMD", "PLAT", "PDGFRL", "PLXDC1", "GALNT15", "GALNT3", "POPDC3", "KCNK1", "PCSK6", "PCSK9", "PTGDR", "PTGER3", "PTGS2", "PTPRQ", "PCDHB5", "RAET1E", "RNF128", "RNF150", "SPP1", "SEMA3C", "SEMA3D", "SEMA6D", "PRSS35", "SNAI1", "SLC1A2", "SLC12A9", "SLC14A1", "SLC2A12", "SLC2A5", "SLC24A3", "SLC38A1", "SLC38A4", "SLC4A8", "SLC7A11", "SLC7A14", "TNXB", "TENM4", "THBD", "TRPC6", "TMEM130", "VCAM1")
Genelist_Secreted = c("ADAMTS15", "ABCB6", "CCL26", "CXCL1", "CXCL2", "CXCL3", "CXCL6", "CXCL8", "C1QTNF2", "FAM20A", "HTRA3", "KCNJ5-AS1", "KITLG", "LIF", "WNT16", "WNT2", "WNT9A", "AKR1B10", "AFP", "ANGPT1", "ANGPTL4", "BMP4", "CEMIP", "COL21A1", "COL22A1", "C1RL", "CFHR1", "ESM1", "EDN1", "EPOR", "FABP5", "GPC1", "GDF15", "GDF6", "HHIP", "IGSF10", "IGFBP3", "IFNE", "IL1B", "IL1RL1", "IL34", "LIPG", "MXRA5", "MMRN2", "NDNF", "OMD", "PINLYP", "PLAT", "PDGFRL", "PLXDC1", "PCSK6", "PCSK9", "RAET1E", "SPP1", "SCG5", "SEMA3C", "SEMA3D", "PRSS35", "TNXB", "THSD4")

#Replace here [Genelist_ECM] with the name of the vector for specific genes c("XX", "XY", "XZ")
selected_DEGs <- Genelist_Secreted

#Filters the matrix to retain only the rows that correspond to the genes listed in 'selected_DEGs' vector.
vst_matSelected <- vst_mat[rownames(vst_mat) %in% selected_DEGs, ]

#Double check to ensure that all selected DEGs were retrieved
#dim(vst_mat)                    #568 DEGs
#dim(vst_matSelected)            #60 DEGs
#length(Genelist_Secreted)       #60 DEGs

#Apply row z-scaling
vst_matSelected_z <- t(scale(t(vst_matSelected)))

#Assign column names to each sample
colnames(vst_matSelected_z) <- c("2DC1", "2DC2", "2DC3", "2DC4", "2DC5", "2DG1", "2DG2", "2DG3", "2DG4", "2DG5")

#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                        #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))   #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                       #Set legend title.
                title_gp = gpar(fontsize = 12),           #Set font size for legend title.
                legend_height = unit(2.5, "cm"),          #Set height of the legend body.
                legend_width = unit(5, "cm"),             #Set width of the legend body.
                grid_width = unit(0.4, "cm"),             #Set width of each legend grid.
                direction = "horizontal",                 #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",             #Set position of the title.
                gap = unit(4, "mm"))                      #Set gap between heatmap and legend.


ht_selected <- Heatmap(vst_matSelected_z,                 #Matrix input for heatmap.
        col = col_fun,                                    #Vector of colors for interpolation.
        rect_gp = gpar(col = "grey30", lwd = 0.15),       #Border line color and width of each cell.
        show_column_names = FALSE,                        #Whether to show column labeling.
        show_row_names = TRUE,                            #Whether to show row labeling.
        row_names_gp = gpar(fontsize = 5),                #Set font size for row labeling.
        clustering_distance_columns = "pearson",          #Set clustering method for column dendrogram.
        clustering_distance_rows = "pearson",             #Set clustering method for row dendrogram.
        #column_labels = colnames(vst_matSelected_z),     #Uncomment to specify column labeling.
        column_names_gp = gpar(fontsize = 5),             #Set font size for column labeling.
        row_dend_reorder = FALSE,                         #Whether to apply row dendrogram reordering.
        column_dend_reorder = TRUE,                       #Whether to apply column dendrogram reordering. 
        show_row_dend = FALSE,                            #Whether to show row dendrogram clustering.
        row_dend_width = unit(10, "mm"),                  #Set width of row dendrograms.
        show_column_dend = FALSE,                         #Whether to show column dendrogram clustering.
        width = unit(3, "in"),                            #Set heatmap width.
        height = unit(3.5, "in"),                         #Set heatmap height.
        heatmap_legend_param = lgd_list,                  #Reference object for legend parameters.
        #show_heatmap_legend = FALSE,                     #Uncomment to hide heatmap legend.
        name = "Z-score")                                 #Set title.

ht_selected = draw(ht_selected, heatmap_legend_side = "bottom") #Draws heatmap with legend at the bottom position.

#Output settings for each heatmap (W = width, H = height)
#2D_ECM: W = 3 x H = 3.5 | Output svg: W = 5 x H = 5
#2D_Secreted: W = 3 x H = 6.5 | Output svg: W = 4 x H = 4.5
#2D_Glycoproteins: W = 3 x H = 3.5 | Output svg: W = 4 x H = 4.5

svg(filename = "2D_Heatmap_Secreted.svg", width = 4, height = 4.5)
draw(ht_selected, heatmap_legend_side = "bottom")
dev.off()
```


```{r rowAnnotation to be used within right_annotation not working yet, include=FALSE}
#ha = rowAnnotation(foo = anno_mark(at = mark_at, labels = Genes_DoubleCheck,
                  #link_width = unit(5, "mm"),
                  #padding = 0.5,
#                  labels_gp = gpar(fontsize = 10
                     #, fontface = "bold"  
#                                   )))
#ha = rowAnnotation(foo = anno_mark(at = which(rownames(vst_matSelected_z) %in% Genes_Label),
#                                    labels = rownames(vst_matSelected_z)[rownames(vst_matSelected_z)%in%Genes_Label],
#                                   which = "row"))
```

<br>
In large **Heatmaps**, in order to avoid visual cluttering by labeling every gene, it is often valuable to label only a subset of genes.

```{r Heatmap Specific Gene Labeling, message=FALSE, warning=FALSE, results='hide', fig.keep='last'}
#Heatmap parameters
col_fun <- colorRamp2(c(-2, 0, 2),                            #Create segments for the color gradient.
                      c("#2166AC", "white", "#B2182B"))       #Create color gradient with the specified color codes.

lgd_list <- list(title = "Z-score",                           #Set legend title.
                title_gp = gpar(fontsize = 12),               #Set font size for legend title.
                legend_height = unit(2.5, "cm"),              #Set height of the legend body.
                legend_width = unit(5, "cm"),                 #Set width of the legend body.
                grid_width = unit(0.4, "cm"),                 #Set width of each legend grid.
                direction = "horizontal",                     #Whether scale is 'horizontal' or 'vertical'
                title_position = "topcenter",                 #Set position of the title.
                gap = unit(4, "mm"))                          #Set gap between heatmap and legend.


#Create vector with subset of genes to label
Glycoproteins_select_label = c("ANGPT1", "B3GNT8", "MGAT4B", "EDNRB", "NDNF", "ADAMTS15", "ST6GALNAC5", "BMP4", "CDH8", "COL21A1", "IL34", "ICAM1", "NOTCH4", "OMD", "SEMA3D", "VCAM1")
Secreted_select_label = c("WNT2", "WNT16", "CXCL1", "CXCL2", "BMP4", "COL22A1", "EDN1", "GDF6", "IL1B", "CEMIP", "LIPG", "SCG5", "MXRA5", "IFNE", "SPP1", "ANGPT1", "NDNF", "KITLG", "AKR1B10")

#Assign genes subset to a static object for easily switching between plots
Genes_Label = Secreted_select_label

#Identify indices of the row names in the matrix that correspond to the genes listed in the vector 'Genes_Label' 
mark_at = which(rownames(vst_matSelected_z) %in% Genes_Label)

#Extract the row names of the matrix that correspond to indices in 'mark_at'. This step is a double-check to ensure that the selected rows do indeed match the original 'Genes_Label' vector.
Genes_DoubleCheck <- rownames(vst_matSelected_z)[mark_at]

#Check whether the two vectors contain the same elements, regardless of their order.
setequal(Genes_DoubleCheck, Genes_Label) #TRUE -> proceed to labeling.

#Note: Use 'Genes_DoubleCheck' object for labeling, because it outputs Gene symbols in the same order as 'mark_at'. This is necessary because 'mark_at' extracts row name numbers, which are automatically ordered numerically in the output vector.

#'anno_mark' function in ComplexHeatmap package enables linking heatmap annotation with customized labels.
foo = anno_mark(at = mark_at,                                 #Numeric index from original matrix.
                #link_width = unit(5, "mm"),                  #Uncomment to specify label link width.
                padding = 0.75,                               #Padding space between neighboring labels.
                labels = Genes_DoubleCheck,                   #Corresponding labels.
                labels_gp = gpar(fontsize = 10),              #Sets font size.
                which = "row")                                #Whether 'column' or 'row' annotation.

ht_selected_label <- Heatmap(vst_matSelected_z,               #Matrix input for heatmap.
                 col = col_fun,                               #Vector of colors for interpolation.
                 rect_gp = gpar(col = "grey30", lwd = 0.15),  #Border line color and width of each cell.
                 show_column_names = FALSE,                   #Whether to show column labeling.
                 show_row_names = FALSE,                      #Whether to show row labeling.
                 #row_names_gp = gpar(fontsize = 4),          #Uncomment to specify row label font size.
                 #row_names_side = "left",                    #Uncomment to specify row label side.
                 clustering_distance_columns = "pearson",     #Set clustering method for column dendrogram.
                 clustering_distance_rows = "pearson",        #Set clustering method for row dendrogram.
                 row_dend_reorder = FALSE,                    #Whether to apply row dendrogram reordering.
                 column_dend_reorder = TRUE,                  #Whether to apply column dendrogram reordering.
                 show_row_dend = FALSE,                       #Whether to show row dendrogram clustering.
                 show_column_dend = FALSE,                    #Whether to show column dendrogram clustering.
                 width = unit(3, "in"),                       #Set heatmap width.
                 height = unit(3.5, "in"),                    #Set heatmap height.
                 heatmap_legend_param = lgd_list,             #Reference object for legend parameters.
                 name = "Z-score")                            #Set title.

ht_merge = ht_selected_label + rowAnnotation(mark = foo)
ht_merge = draw(ht_merge, heatmap_legend_side = "bottom", align_heatmap_legend = "heatmap_left") #Note: 'align_heatmap_legend' = 'heatmap_center' position is offset because heatmap has been merged with 'rowAnnotation'.
svg(filename = "2D_Heatmap_Secreted_CompactLabel.svg", width = 4.5, height = 4.5)
dev.off()
```





